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Abstract 

Probability is often counter-intuitive, and it always involves a great 
deal of math. This is unfortunate, because many applications in 
robotics and AI increasingly rely on probability theory. We intro¬ 
duce a modular toolkit for constructing probability monads, and 
show that it can be used for everything from discrete distributions 
to weighted particle filtering. This modular approach allows us to 
present a single, easy-to-use API for working with many kinds of 
probability distributions. 

Our toolkit combines several existing components (the list 
monad, the Rand monad, and the MaybeT monad transformer), 
with a stripped down version of WriterT Prob, and a new monad 
for sequential Monte Carlo sampling. Using these components, we 
show that MaybeT can be used to implement Bayes’ theorem. 
We also show how to implement a monad for weighted particle 
filtering. 

Categories and Subject Descriptors D.3.3 [Programming Lan¬ 
guages ]: Language Constructs and Features—Control structures; 
G.3 [Probability and Statistics] 

General Terms Probability, Monads 
Keywords Bayesian filtering, particle filters 

A very senior Microsoft developer who moved to Google 
told me that Google works and thinks at a higher level of 
abstraction than Microsoft. “Google uses Bayesian filtering 
the way Microsoft uses the if statement,” he said. -Joel 
Spolsky 

1. Introduction 

Probability is notoriously tricky and counter-intuitive. It’s easy to 
ignore prior probabilities, confuse P(A\B) with P(B\A), or get 
lost while trying to generalize Bayes’ theorem. But we encounter 
probabilities more than ever, thanks to recent trends in search 
algorithms, robotics and artificial intelligence. 

To address these challenges, researchers have built several ex¬ 
cellent programming languages based on probability distribution 
monads (30j[27][]D- Some of these languages use random sampling; 
others compute exact results. But all of these languages are delight¬ 
ful tools—they make previously subtle problems intuitive and easy. 

But these waters are deeper than a casual glance reveals. Many 
problems in probability theory can be expressed in terms of mon- 
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ads nsnnma. And if we examine a few such monads, certain 
repeated patterns become obvious. In fact, most probability distri¬ 
bution monads can be built from a small “toolkit” of monads and 
monad transformers. The same parts are shared by discrete distri¬ 
butions, random sampling monads, and even particle filters. 

In general, the monads built from this toolkit make pleasant pro¬ 
gramming languages. For example, imagine that we have an in¬ 
fluenza test with a 307, false-positive rate, and a 107, false-negative 
rate (2) . If we assume that 107, of the population has influenza, what 
is the probability that someone with a positive test result is actually 
infected? Instead of messing around with Bayes’ theorem, we can 
simply write: 

fluStatusGivenPositiveTest = do 
fluStatus <— percentWithFlu 10 
testResult <— if fluStatus = Flu 

then percentPositive 70 
else percentPositive 10 
guard (testResult = Pos) 
return fluStatus 

This function will return a probability distribution of 447, Flu and 
567, Healthy. Because only a small portion of the population is 
actually infected, the false positives actually outnumber the true 
ones. This monad in this example is similar to other implemen¬ 
tations of the discrete probability monad (8), with the addition of 
guard, which is used to do implicit Bayesian filtering. 

We make the following contributions: 

• We introduce a toolkit for constructing probability distribution 
monads (Section [3j. The toolkit consists of three monads (the 
list monad, a Monte Carlo monad, and a sequential Monte 
Carlo monad), and two monad transformers ( PerhapsT and 
MaybeT). We use this toolkit to recreate an existing probability 
distribution monad (Sections |3.2| and|5l>. 

• We implement Baye s’ th eorem using the MaybeT monad 
transformer (Sections |3.3| and|6). MaybeT allows us to discard 
possible outcomes, neatly encapsulating the notion of sampling 
with rejections. 

• We develop several new monads for sequential Monte Carlo 
sampling, also known as particle filtering (Sections |3.4| and|7). 
These monads support both rejection filters (using MaybeT) 
and weighted filters (using PerhapsT). 

2. Background 

Probability theory poses many challenges for programming lan¬ 
guage designers. We must choose from a bewildering variety of 
representations and techniques. We must work around the limita¬ 
tions of the human intuition, which is notoriously bad at probabil¬ 
ity. And we must keep the math from obscuring the actual problems 
of interest. Solving all these problems at once is beyond the scope 
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of any existing programming language. At best, we can hope to find 
a sweet spot in the design space. 

2.1 Choosing a representation 

Probability theory offers a rich variety of problem-solving tools. At 
times, this variety is perhaps too rich. For example, probability dis¬ 
tributions may be represented as discrete distributions (8), random 
sampling functions l30l l27l [§), measure terms l30l . Kalman fil¬ 
ters CEDED, multi-hypothesis Kalman filters ED, or particle filters 
flQl l6l. Similarly, Bayes’ theorem may be implemented using ex¬ 
plicit calculations, the rejection method (30l . importance sampling 
(30), weighted particle filters, or more sophisticated techniques. 

Of course, each of these choices comes with tradeoffs. Discrete 
distributions offer exact answers, but they require exponential time 
to solve certain problems. Sampling functions can represent arbi¬ 
trary distributions, but they can only calculate approximate distri¬ 
butions. Kalman filters are extremely efficient, but they treat all 
distributions as simple Guassians. 

To further complicate matters, many of these techniques are 
traditionally described in ways that blur the underlying algebraic 
connections. For example, the various implementations of Bayes’ 
theorem have quite a lot in common. But those similarities are hard 
to see if we compare the textbook formula for Bayes’ theorem (3D 
with a description of weighted particle filters (Tol . 

In an ideal world, we would be able consolidate as many of 
these techniques as possible under a single programming interface, 
making the algebraic connections obvious. 

2.2 Coping with faulty intuitions 

Probability is often counter-intuitive. Even physicians, who work 
with diagnostic tests on a daily basis, commonly make order-of- 
magnitude errors in interpretation. In an informal study by Eddy, 
most physicians concluded that a patient with an apparently benign 
breast mass but a positive mammogram had a 75% chance of can¬ 
cer. The actual chance of cancer, as calculated by Bayes’ theorm, 
was only 7.7% Q. 

Other probability puzzles are similarly misleading. In a famous 
Parade Magazine article, Marilyn vos Savant described the “Monty 
Hall” problem, in which contestants must choose from several 
doors, one of which hides a prize m . Most of vos Savant’s readers 
chose the wrong door, thanks to subtle ambiguities in the problem 
statement and the use of inappropriate heuristics (23). 

To correctly answer a question about probability, we must first 
specify the details. In particular, we must know the starting condi¬ 
tions, the protocols followed by any agents in the puzzle, and the 
exact values we want to compute. This kind of precise specification 
is well-suited to a programming language. 

2.3 Separating the problem from the math 

Once we’ve decided how to represent a problem, and specified 
it precisely, we still need to do the math. But the math itself is 
frequently a barrier—instead of focusing on the actual problem, we 
often wind up trying to remember which version of Bayes’ theorem 
applies in our particular case. Even worse, this preoccupation with 
formulas can blind us to simpler ways of looking at the problem. 

Consider the influenza test in Section E We know that the test 
has a 30% false-positive rate and a 10% false-negative rate. If 10% 
of the population has influenza, we know that 

P{I) = 0.1 P(+|/) = 0-7 P(+H) = 0-1 

where P(I) is the probability that a patient has influenza, P(+|7) 
is the probability of a true positive, and P(+\-<I) is the probability 
of a false positive. We can plug these numbers into Bayes’ theorem, 
and calculate P(ij+), the probability that patient has influenza, 


PerhapsT [] 

MaybeT (PerhapsT []) 

MC 

MaybeT MC 
PerhapsT MC 


Lists of outcomes 
Discrete distributions 
... with rejection 

Monte Carlo sampling 
... with rejection 
... with weights 


SMC 

MaybeT SMC 
PerhapsT SMC 


Sequential Monte Carlo sampling 
... with rejection 
... with weights 


Table 1. The probability monad toolkit. 


given a positive test result. 

P m +) = _ P(+\i)Pd) _ = X 

1 p(+|j)p(j) + p(+hj)p(-ij) i6 

But we can solve this problem more easily using a visual ap¬ 
proach. Assuming we have 100 patients, we can separate them into 
10 patients with influenza, and 90 healthy patients. If we then rep¬ 
resent positive test results with “+”, our population looks like: 

Influenza T + 4 + 4- -F .+ 6 # n 
Healthy + 000000000 




We can see that 7/16ths of positive test results occur in patients 
with influenza. The ease with which we can read answers off this 
diagram represents our ideal; any programming language should be 
as straightforward. 


3. The probability monad toolkit 

We are now ready to introduce our toolkit for building probability 
monads. Our toolkit relies on two main ideas: 

1. Monads allow us to split a computation into two parts: the main 
program, and the bookkeeping details (22] (36). In the main 
program, we describe our problem in high-level terms, leaving 
out most of the math. But behind the scenes, a monad keeps 
track of the math for us, calculating probabilities and applying 
Bayes’ theorem. 

2. Monad transformers allow us to start with base monads, and 
layer on extra features as needed (2l][l7][34][9). Our base mon¬ 
ads represent simple ideas: lists, sampling functions, and sets 
of particles. Everything else—including the probability calcu¬ 
lations, the weights of the particles, and the implementation of 
Bayes’ theorem—is supplied by one or more monad transform- 


Using only three base monads and two monad transformers, we are 
able to construct a wide range of probability distribution monads 
(Table [TJ. 
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[ ] (list monad) 


PerhapsT [] 


MaybeT (PerhapsT []) 


(Flu, Pos) 
(Flu, Neg) 
(Healthy, Pos) 
(Healthy, Neg) 


T! (Flu, Pos) 

3% (Flu, Neg) 

9% (Healthy, Pos) 
81% (Healthy, Neg) 


7% Just (Flu, Pos) 

3% Nothing 

9% Just (Healthy, Pos) 
81% Nothing 


Table 2. Building a monad in layers, with example data. 


3.1 The list monad 

The list monad allows us to combine elements from several lists, 
generating every possible outcome (36). Continuing with our in¬ 
fluenza example, we define two data types: 
data Status = Flu \ Healthy 
data Test = Pos \ Neg 

Using Haskell’s do-notation, we pick one item from each list and 
return the result. In this example, the <— symbol should be read as 
“pick one item from.” 

outcomes :: [(Status, Test)] 
outcomes = do 
status <— [Flu, Healthy] 
test <— [Pos, Neg] 
return (status, test) 

The type declaration [(Status, Test)] is important, because it tells 
Haskell to interpret the do-body as a computation in the list monad. 
When we run this code, it will return a list of every possible 
outcome: 

(Flu, Pos) (Flu, Neg) 

(Healthy, Pos) (Healthy, Neg) 

Whenever the list monad encounters , it makes every possible 
choice, backtracking as necessary. The list monad also appears 
in Haskell and other languages as a list comprehension, a special 
syntax for building lists (36): 

[(status, test) \ status *— [Flu, Healthy], 
test <— [Pos, Neg]] 

For more information on the list monad, see Section |4~T1 

3.2 The PerhapsT monad transformer 

By itself, the list monad has no way to keep track of probabilities. 
We can fix this using the PerhapsT monad transformer, 
type DDist = PerhapsT [] 

PerhapsT takes an existing monad, and attaches a probability to 
each value in the computation. The probabilities are tracked invisi¬ 
bly in the background, giving us a discrete probability distribution: 

7% (Flu, Pos) 3% (Flu, Neg) 

9% (Healthy, Pos) 81% (Healthy, Neg) 

The code that generates this distribution is similar to our previous 
example. We replace the lists with calls to weighted, which con¬ 
structs weighted distributions: 

weightedOutcomes :: DDist (Status, Test) 
weightedOutcomes = do 

status <— weighted [(10, Flu), (90, Healthy)] 
test <— 
if (status = Flu) 

then weighted [(70, Pos), (30, Neg)] 
else weighted [(10, Pos), (90, Neg)] 
return (status, test) 


When run, weightedOutcomes returns a list of possible results 
and their probabilities, as shown in the table above. Note that the 
final DDist monad is equivalent to Erwig and Kollmansberger’s 
Dist monad (8). For more information on the PerhapsT monad 
transformer, see Section[5] 

3.3 The MaybeT monad transformer 

In Section [h3l we implemented Bayes’ theorem by counting up all 
the patients marked “+”, and ignoring the rest. We can get a similar 
effect using the MaybeT monad transformer: 

type BDDist = MaybeT DDist 
MaybeT takes an existing monad, and replaces each value of type 
a with either Just a or Nothing (T). The Nothing values represent 
failed “branches” in the computation, values of no interest to us. 
Filtering for test = Pos, we get: 

7% Just (Flu, Pos) 3% Nothing 

9% Just (Healthy, Pos) 81% Nothing 

At the end of the computation, we can discard all the Nothing 
values, and scale the remaining percentages so that th ey su m to 
100%. Compare this example to the diagram in Section [23] Both 
are implementations of Bayes’ theorem using the rejection method 
and a normalization factor (27) [33. 

The filtered WeightedOutcomes function is identical to our 
earlier weightedOutcomes, except for the type declaration and the 
guard function on the second-to-last line: 

filteredWeightedOutcomes :: BDDist (Status, Test) 
filtered WeightedOutcomes = do 
status <— ... 
test <— ... 
guard (test = Pos) 
return (status, test) 

The guard function checks to see if test = Pos, and if not, 
replaces the current branch of the computation with Nothing. 

See Table[2]for a step-by-step breakdown of how the PerhapsT 
and MaybeT monad transformers are used to construct BDDist. 
For more information on the MaybeT monad transformer, see Sec¬ 
tion [6) 

3.4 The MC and SMC monads 

Monte Carlo algorithms rely on random sampling to compute ap¬ 
proximate answers (20) . We implement random sampling using 
the MC monad, described under various names by previous re¬ 
searchers (30][27] [8). The example code for the MC monad is iden¬ 
tical to weightedOutcomes, except for the type declaration: 
sampledOutcomes :: MC (Status, Test) 
sampledOutcomes = ... 
sample sampledOutcomes 10 

The sample function runs our code repeatedly, collecting the re¬ 
sults in a list. 

Sequential Monte Carlo sampling, also known as particle fil¬ 
tering, differs from ordinary sampling in that all our samples pass 
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Figure 1. Monte Carlo sampling (left) and sequential Monte Carlo 
sampling (right). In the latter case, we represent our samples with a 
cloud of particles, which travel as a group through each step x, y, z 
of the computation. 


through each step of the computation as a group (To] |6). See Fig¬ 
ure |3.4| for a comparison of the two approaches. We can perform 
sequential Monte Carlo sampling using the SMC monad. 

The MC and SMC monads can also be combined with our 
monad transformers (Table[l]l. For example, we can use PerhapsT 
to construct a weighted version of the SMC monad: 

type WSMC = PerhapsT SMC 
Using the WSMC monad, we can replace our earlier calls to guard 
with conditional probabilities. For example, if we know a patient’s 
test result is positive, we can write: 

statusGivenPosResult :: WSMC Status 
statusGivenPosResult = do 

status <— weighted [(10, Flu), (90, Healthy )] 
if (status = Flu) 
then applyProb 0.7 
else applyProb 0.1 
return status 

This is a classic weighted particle filter flOl . For more information 
on the MC and SMC monads, see Sections [43| and[7l respectively. 


4. Monads and probability in Haskell 

So far, our presentation of probability distribution monads has been 
informal. Now we provide the theory that supports the toolkit, be¬ 
ginning with a short introduction to monads and monad transform- 


4.1 Monads 

In Haskell, a type class describes an abstract interface, which may 
be implemented by one or more actual types. The Monad type 
class specifies two functions that must be defined by every monad 
m (36] [28). It also constrains m to be a type constructor with a 
single argument: 

class Monad m where 
return :: a —> m a 
(»=) m a —y (a —y m b) —> m b 
The return function takes a value of type a, and constructs a new 
value “in the monad,” that is, a new value of type m a. The »= 
operators a bit trickier. It is best understood as two operations, 
liftM and join: 

liftM :: (Monad m) =*• (a -» b) -> m a -> m b 

join :: (Monad m) =>■ m (m a) —y m a 

1 Read as “bind.” 


ma^=f= join (liftM f ma) 

The liftM function is analogous to the standard map function. 
Given a function of type a —> b, and value of type rn a, it reaches 
inside m a, and replaces a with b. Similarly, the join function is 
analogous to concat. It takes a nested value of type m (m a), and 
collapses into a value of type m a. 

For example, Haskell’s standard list type forms a monad. Note 
the use of concat and map in place of join and liftM: 
instance Monad [] where 
return x = [a;] 

ma 3= / = concat (map f ma) 

Haskell’s do-notation is syntactic sugar for the >*= operator. It 
expands as follows: 
doi<- [1,2] 
return (x * 3) 

[1,2] :»= (Xx —y return (x * 3)) 

These two expressions are equivalent, and both return [3,6]. 

4.2 Probabilities and distributions 

We represent a probability as a rational number, allowing us to 
work exact probabilities whenever possible, 
newtype Prob = Prob Rational 

deriving (Eg, Ord, Num, Fractional) 

The deriving clause automatically implements the specified type 
classes for us, based on the implementations for Rational. 

The Dist type class specifies the interface that must be imple¬ 
mented by a distribution type d. 

class (Monad d) => Dist d where 
weighted :: [(Rational, a)] —* d a 
The constraint (Monad d) requires all distributions to be monads. 
The Dist type class requires only one function, weighted, which 
constructs a weighted distribution from a list of weights and values. 

We can define a uniform distribution by assigning each value a 
weight of 1: 

uniform :: Dist d =>■ [a] —> d a 
uniform = weighted o map (\v —> (1, v)) 

For a more realistic Dist interface, see earlier papers by Park and 
colleagues (27) and Erwig and Kollmansberger (8). 

4.3 The MC monad: An example 

We can turn the proposed Rand monad Q] into a probability 
distribution by defining a Dist instance for it. This corresponds to 
the sampling monads which have appeared in a number of earlier 
papers (30] E7] ID- 

type MC = Rand StdGen 
instance Dist MC where 

weighted wvs = fromList (map flip Pair wvs) 
where flipPair (a, b) = (b, a) 

We create a type alias MC, which refers to Rand StdGen. The 
parameter StdGen specifies what source of random numbers will 
be used, and the fromList function makes a weighted selection 
from a fist. We also define a sample function: 

sample :: MC a —► Int —> MC [a] 
sample r n = sequence (replicate n r) 

The standard sequence function converts a value of type [m a] 
into a value of type m [a]. 
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4.4 Monad transformers 

In denotational semantics, a monad morphism is a mapping from 
one monad to another tm Monad morphisms take an underlying 
monad, and add features such as state, an environment, or contin¬ 
uations. Less formally, monad morphisms may be thought of as 
type-level functions from one monad to another. 

In Haskell, we can achieve the same result using a monad trans¬ 
former CMIIIS). To define a monad transformer ExampleT, we 
must implement both return and >■= in terms of m, the monad to 
be transformed: 

instance (Monad m) => Monad (ExampleT m) where 
return — ... 
ma^f = ... 

We must also provide an implementation of lift, which maps values 
from a monad m into its transformed counterpart t m. 

class MonadTrans t where 
lift :: Monad m=$-ma^tma 

For examples of monad transformers, see Sections [53| and[6| 

5. Monoids and discrete distributions 

In this section, we introduce monoids, M- sets, and the MVT 
monad transformer. Using these pie ces, we build the DDist dis¬ 
tribution described in Section |3.2| Special thanks go to Dan 
Piponi, for inspiring the treatment of M- sets in Section |5.2| and 
to Cale Gibbard, for first noticing the decomposition of DDist into 
WriterT Prob []. 

5.1 Monoids 

A monoid is a triple (M, e, ®), where M is a set, ® is an as¬ 
sociative binary operation, and e is an identity element such that 
e®2i = a:®e = a;.In Haskell, we can represent a monoid using 
the built-in Monoid type class m 

class Monoid a where 
mempty :: a 
mappend :: a —> a —> a 

Here, mempty is e, and mappend is ®. 

Probabilities form a monoid (P, 1, x), where P is the set of 
all real numbers between 0 and 1 inclusive, and x is ordinary 
multiplication. In Haskell, we write this as: 

instance Monoid Prob where 
mempty = 1 

pi ‘ mappend 1 p2 = pi * P2 

5.2 M-sets and the MV monad 

An M-set is a set X with a monoid action (■), such that 

e-x = x (1) 

mi • (m 2 • x) = (mi <g> m 2 ) ■ x (2) 

where x € X, and rn, n £ M. A free M-set is any M-set where 
m • x = x only if m = e. (3) 

For an equivalent definition of M- sets, see Kilp and colleagues [16] 
pp. 43—44,68]. 

Given a set S, define P x S to be the set of pairs (p, s) such that 
p € P and s 6 S. In other words, the set P X S corresponds to 
elements of S annotated with probabilities. Now define our monoid 


action to be pi • (p 2 , s) = (pip 2 , s). We can easily see that 
1 • (p, s) = (p, s) 

Pi ■ (P2 • (P3, s)) = (P1P2P3, a) 

= (pip 2 ) • (p 3 ,s) (5) 

pi • (P2, s) = (p 2 , s) only if pi = 1. (6) 

From this, we conclude that P X S is a free M-set. 

In Haskell, we can represent an element of an M-set using the 
type MV, which contains a monoid and a value, 
data (Monoid w) => MV w a = 

MV{mvMonoid :: w,mvValue :: o} 

The type MV Prob a corresponds to the set P x S above. 
Interestingly, we can make MV a monad. 
mapMV f (MV w v) = MV w (f v) 
joinMV (MV wi (MV w 2 v)) = 

MV (w l l mappend i w 2 ) v 
instance (Monoid w) =>■ Monad (MV w) where 
return v = MV mempty v 
mv^f = joinMV (mapMV f mv) 

The return function corresponds to a map x i—> (e, a;), and 
mempty to the monoid identity e. The joinMV function corre¬ 
sponds to our monoid action (•). 

Haskell programmers will recognize the MV monad as a 
stripped-down version of the standard Writer monad mi We 
omit the listen and pass functions, which are useless in this 
context. We also omit the tell function, which we replace with 
applyProb: 

class (Monad m) => MonadCondProb m where 
applyProb :: Prob —> m () 
instance MonadCondProb (MV Prob) where 
applyProb p = MV p () 

We can use applyProb and MV to multiply a set of independent 
probabilities together. For example, if we have an influenza test 
with a 30"/, false negative rate, and we know that 10’/, of the popu¬ 
lation has influenza, we can calculate the actual percentage people 
receiving a false positive: 

flu — applyProb 0.1 

falseNegative = applyProb 0.3 
missedFluCases :: MV Prob () 
missedFluCases = do 
flu 

falseNegative 

The MV monad multiplies 0.1 and 0.3 behind the scenes, and re¬ 
turns MV 0.03 (). This tells us that 3’/. of the population will have 
influenza and incorrectly receive a negative test result. Conceptu¬ 
ally, missedFluC ases corresponds to a single “path” through the 
example in Section [V2] picking out only the case (Flu, Neg). Note 
that MV Prob is not a probability distribution monad! At best, it 
represents a single element of a probability distribution. 

5.3 The MVT monad transformer 

The MV monad is only marginally useful by itself. We need some 
way to combine the features of MV with other monads. We do this 
by defining MVT, a stripped-down version of the WriterT monad 
transformer fj~5l . 

newtype (Monoid w, Monad m) =>■ MVT w m a = 
MVT{runMVT :: m (MV w a)} 
instance (Monoid w ) =>- 
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MonadTrans (MVT w ) where 
lift mv = MVT (do v ^ mv 

return (MV mempty v)) 

The trick is to take our underlying monad m a, and replace every 
occurrence of a with MV w a. The runMVT function simply ex¬ 
tracts the value packed inside our MVT wrapper. The lift function 
will take an existing value of type m a, and promotes it to a value 
of type MVT w a. 

The Monad instance for MVT is closely modelled on that of 
MV. 

instance (Monoid w, Monad m) =>• 

Monad (MVT w m) where 
return = lift o return 
ma »= / = MVT bound 
where bound = do 

(MV wivi)<- runMVT ma 
(MV w 2 v 2 ) <- runMVT (f ui) 
return (MV (wi ‘ mappend ‘ w 2 ) v 2 ) 

We define the return function for MVT in terms of lift and the 
underlying monad’s return. The bind function is a bit trickier. We 
need runMVT ma to get a value of type m (MV w a), and 
use <— to strip off m. The final two lines perform the same work 
as mapMV and bindMV, respectively, but do so in the context of 
our underlying monad. 

5.4 The DDist monad 

Now we’re finally ready to build our discrete distribution monad. 
We define PerhapsT to be an alias for MVT Prob, and apply it 
to the standard list monad. 

type PerhapsT = MVT Prob 
type DDist = PerhapsT [] 

The list monad, as we saw in Section |3.1| follows every possible 
“branch” of a computation. When we apply PerhapsT to the list 
monad, we associate a probability with each branch. To split a 
branch into sub-branches, we use weighted. 

instance Dist DDist where 

weighted wvs = MVT (map toMV wvs) 
where toMV (w, v ) = MV (Prob (w / total)) v 
total = sum (map fst wvs) 

Note that we use total to normalize the weights, forcing them 
to add up to 1. This not only ensures that weighted returns a 
valid probability distribution, it also guarantees that the weights 
calculated by ;§= for our sub-branches will add up to the weight of 
our original branch. 

The DDist monad is a stripped-down version of Erwig and 
Kollmansberger’s probability monad (8). The factoring of DDist 
into WriterT Prob [] has been independently discovered by 
several people, including Cale Gibbard. 

6. Bayes’ theorem and MaybeT 

In Sections |2,3| and |3.3| we used Bayes’ theorem (5) to interpret 
the result of an influenza test. In this section, we show how to 
implement Bayes’ theorem using the MaybeT monad transformer. 

6.1 Bayes’ theorem 

We adopt the convention that P(A) specifies a vector of probabili¬ 
ties, one for each possible value of A. 

P(A) = {P(A = ai),...,P(A = a n )) (7) 


Two such vectors may be multiplied pointwise. 


P (B = b\A)P(A) = (P(B = b\A = ai)P(A = m),..., 

P(B = b\A = a„)P(A = a„)) (8) 


Using this notation, we can state Bayes’ theorem (3T] p. 479] as 


P (B = 6|A)P(A) 


J2 i P(B = b\A = a i )P(A = a i ) 
Now recall the program from Section [T3| 


filteredWeightedOutcomes :: BDDist ( Status, Test) 
filteredWeightedOutcomes = do 

status <— weighted [(10, Flu), (90, Healthy )] 
test <— ... 
guard (test = Pos) 
return (status, test) 


(9) 

( 10 ) 


This returns a list of possible results, and their corresponding prob¬ 
abilities: 


7'/, Just (Flu, Pos) 3’/, Nothing 

9’/. Just (Healthy, Pos) 817, Nothing 


The Nothing values represent those branches of our computation 
on which guard (test = Pos) failed. 

We can apply Bayes’ theorem to this example as follows. Let 
P(A) = {P(A = ai ),...,P(A = a n )) (11) 


where a\,..., a n are the results of each branch of our computation. 
Let G = gi A • • • A g m , where g, is true if and only if our 7-th guard 
clause succeeds. This gives us 


P(A|G) = 


P(G|A)P(A) 

Y, i P(G\A = a i )P(A = a i ) 


( 12 ) 


But for each branch at of our computation, G is true if and only if 
m ^ Nothing. This gives us 


allowing 


P(G\A = at) = 


1 if ai ^ Nothing 
0 if m = Nothing 


us to simplify GU to 
P(,4|G) = £ 


P(G|A)P(A) 

*Not h in g P(A = ai) 


(13) 


(14) 


But this is equivalent to discarding all the Nothing terms, and 
normalizing the remaining probabilities. Compare this result to the 
diagram in Section [23] 


6.2 The MaybeT monad transformer 

To build our Bayesian monad, we need an implementation of 
MaybeT [l]. The MaybeT monad transformer lifts a computa¬ 
tion of type matoa computation of type m (Maybe a). 
newtype MaybeT m a = 

MaybeT{runMaybeT :: m (Maybe a)} 
instance MonadTrans MaybeT where 
lift x = MaybeT (liftM Just x) 

Here, liftM Just has the type m a —> m (Maybe a), where m is 
the monad being lifted. 

We also need to define new versions of return and >*= using 
the versions defined by m. 

instance (Monad m) => Monad (MaybeT m) where 
return x = lift (return x) 
ma / = MaybeT (runMaybeT ma ~^*=f') 
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where /' Nothing = return Nothing 

f (Just x) = runMaybeT (f x) 

There are two key concepts here: 

1. Neither return or create new Not hing values. This must 
be done using mzero, defined in Section[63] 

2. Once a Nothing value is inserted into the monad, it will be 
passed along unchanged by suppressing all further calls to 
any function /. 

Note that we provide no MonadPlus instance for MaybeT, be¬ 
cause it has ambiguous semantics. It could refer to either 

instance (MonadPlus m) =£- MonadPlus (MaybeT m) 
which lifts the semantics of an underlying MonadPlus instance, or 
instance (Monad m) => MonadPlus (MaybeT m) 
which provides semantics similar to MonadPlus Maybe. There¬ 
fore, we leave the choice up to the user of MaybeT. 

6.3 The BDDist monad 

Now we are ready to define BDDist, a discrete distribution monad 
with support for Bayes’ theorem. We apply MaybeT to DDist, and 
lift the underlying weighted function into the new monad, 
type BDDist = MaybeT DDist 
instance (Dist d) Dist (MaybeT d) where 
weighted wvs = lift (weighted wvs) 

The guard functiorj^] is actually supplied by the standard Haskell 
library (28): 

guard :: MonadPlus m => Bool —> m () 

guard cond = if cond then return () else mzero 

If cond is true, then guard returns (). This continues the current 
branch of the computation unchanged. But if cond is false, guard 
returns mzero. This has the effect of injecting Nothing into the 
computation, killing off the current branch. 

To take advantage of guard, we need to make BDDist an 
instance of MonadPlus. Haskell also forces us to supply mplus, 
which we don’t need in this paper. 

instance MonadPlus BDDist where 
mzero = MaybeT (return Nothing) 
di l mplus‘ d® = ... 

We now have a monad which supports guard, and returns values 
of type Maybe a. We want to turn that Maybe a back into a, 
keeping the Just values and discarding the Nothing values. We 
can do this using catMaybes'. 

catMaybes' :: (Monoid w) =>■ 

[MV w (Maybe a)] -*• (MV w a] 
catMaybes' = map (liftM from Just) o 
filter (isJust o mvValue) 

Now we’re ready to impl emen t Bayes’ theorem, following the 
strategy outlined in Section |6J| We use catMaybes' to discard all 
the Nothing values, and sum the remaining probabilities into total. 

bayes :: BDDist a —> Maybe (DDist a) 
bayes bfd 

| total = 0 = Nothing 

| otherwise = Just (weighted (map unpack events)) 


2 Earlier versions of this paper used a condition function. Thank you to 
David House for noticing that condition was identical to guard. 


where 

events = catMaybes' (runMVT (runMaybeT bfd)) 
total = sum (map mvMonoid events) 
unpack (MV (Prob p) v) = ( p, v) 

If total = 0, then our guard conditions have failed on every 
possible path, and we return Nothing. Otherwise, we construct 
a discrete probability distribution, using weighted to perform the 
actual normalization. 

6.4 The BMC monad 

We can use MaybeT to define a second Bayesian monad, this one 
based on rejection sampling (27). Again, we omit the details of 
mplus. 

type BMC = MaybeT MC 
instance MonadPlus BMC where 
mzero = MaybeT (return Nothing) 
di l mplus‘ d2 = ... 

The BMC monad returns samples of type Maybe a. Once again, 
we want to keep the Just values and discard the Nothing values. 
We can do that using the sampleWithRejections function, which 
takes n samples from a distribution, rejects those samples equal 
to Nothing, and returns the rest. These remaining samples are the 
ones that made it by all of our guard conditions. 

sampleWithRejections :: BMC a —> Int —> MC [a] 
sampleWithRejections d n = 

(liftM catMaybes) (sample (runMaybeT d) n) 

Note that this function may return far fewer than n samples. This 
is because the distribution d, in a worst-case scenario, may never 
produce any samples at all. This will occur with the distribution 
guard False, and other distributions with impossible-to-satisfy 
guard conditions. Enhanced versions of sampleWithRejections 
must take care not to hang in these circumstances. 

7. Sequential Monte Carlo sampling 

Sequential Monte Carlo sampling is used in robot localization, 
computer vision, signal processing and econometrics (TOl [6). It 
differs from regular Monte Carlo sampling in that it represents 
samples as sets of “particles,” the values of which are updated over 
time. In a typical application, each particle might represent one 
possible location of a robot. Initially, the particles are positioned 
at random. As the robot moves, the particles are moved along with 
it (perhaps with some random variation, if the speed of the robot 
is uncertain). If a particle winds up in an impossible location, such 
as inside a wall, that particle can be discarded and a new particle 
allocated elsewhere. The cloud of particles will eventually converge 
on one or more probable locations for the robot. 

The major advantage of sequential Monte Carlo sampling over 
ordinary sampling is the ability to reallocate particles dynamically. 
This allows the algorithm to focus resources on the most interesting 
hypotheses, and therefore reduces the total number of samples 
required. 

7.1 The SMC monad 

We define our SMC monad in terms of MC. Internally, we repre¬ 
sent SMC as a function mapping the desired number of samples to 
a list of actual samples. The list itself must be in the MC monad 
because it can only be generated using random numbers, 
newtype SMC a = 

SMC{runSMC :: Int -► MC [a]} 
liftMC :: MC a —» SMC a 
liftMC r = SMC (sample r) 
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Figure 2. mapSMC and joinSMC. 


We can lift a distribution from MC to SMC simply by taking the 
requested number of samples. The mapSMC function is a bit more 
complex, however. We define a new sampling function, mapped, 
which first turns a distribution d into actual samples, and then 
applies / to each sample. We package mapped inside our SMC 
monad, where it can be run when needed. 

mapSMC :: {a-> b)-+ SMC a -> SMC b 
mapSMC f d = SMC mapped 
where mapped n = liftM (map f) (runSMC d n) 

The joinSMC function is the heart of the SMC monad. We begin 
with a cloud of particles, where each particle is itself an entire 
cloud (Figure [7T| middle column). We want to take a single particle 
from each nested cloud, and put those particles into a new, flattened 

To do this, we define a function joined, which takes n samples 
from the outermost cloud, and stores them in the list ds, which has 
type [SMC a]. It then takes one sample from each of these clouds 
using samplel, and stores them in xss, which has type [[a]]. All 
that remains is to flatten this list using concat. 

joinSMC :: SMC {SMC a) -► SMC a 
joinSMC dd = SMC joined 
where joined n = do 
ds <— (runSMC dd n) 
xss *— sequence (map samplel ds) 
return {concat xss) 
samplel d = runSMC d 1 

By taking only one sample from each of our nested clouds of 
particles, we effectively propagate each particle into one of its own 
possible futures. 

The remaining code is trivial. 

instance Monad SMC where 
return = liftMC o return 
ps »= / = joinSMC {mapSMC f ps) 
instance Dist SMC where 
weighted = liftMC o weighted 

7.2 The WSMC monad and weighted particle filtering 

Sequential Monte Carlo sampling is commonly performed using 
weighted particles. This is a form of importance sampling, where 
each sample has an associated weight |6). A high-importance sam¬ 
ple is interpreted as more “likely” than a low importance sample. 
This allows us to get more data out of a given number samples. In¬ 
stead of discarding samples, as we did using MaybeT in Section[6] 
we can simply mark those samples as unlikely. 


The definition of the WSMC monad should he unsurprising at 
this point. It relies on the same techniques seen in Section [6] with 
only cosmetic differences. 

type WSMC = PerhapsT SMC 
instance Dist WSMC where 
weighted wvs = lift (weighted wvs) 
run WSMC :: WSMC a -> Int -> MC [MV Prob a] 
runWSMC wps n = runSMC {runMVT wps) n 

We do, however, provide two new features. The applyProb func¬ 
tion is a weighted version of guard. It multiplies the current parti¬ 
cle’s weight by p. Recall that p here is a probability, so we know 
that 0 < p < 1. The actual multiplication is handled for us by 
PerhapsT. 

instance MonadCondProb WSMC where 
applyProb p = MVT (return {MV p ())) 

In practice, applyProb is used to apply a new piece of evidence 
to our particles. For example, imagine that we have a robot with a 
door sensor. Let x be the current reading of the door sensor, and 
let P{x\pos) be the probability of observing x at pos. We want to 
call applyProb with an argument of P{x\pos), which will apply 
an appropriate weight to each of our particles. For an excellent 
illustration of this process, see Fox and colleagues flOl . 

As time goes on, however, repeated calls to applyProb will 
leave many of our particles with weights near zero. Periodically, 
we want to replace these particles with higher-probability ones. We 
can do this using the resample function. 

resample :: WSMC a -> WSMC a 
resample d = lift (SMC resampled) 
where resampled n = do 

xs <- runSMC {runMVT d) n 
sample {weighted {map unpack xs)) n 
unpack {MV {Prob p) x) = {p, x) 

The resample function treats our weighted particles as a probabil¬ 
ity distribution, and takes n new samples. This process will favor 
particles with high weights over particles with low weights, con¬ 
centrating most of our particles in places where they’ll do the most 
good. This particular implementation of resampling, however, is 
extremely naive and has statistical problems. See Doucet and col¬ 
leagues for a survey of more sophisticated approaches |6j- 

8. Related work 

Probability distribution monads are discussed by Lawvere (Hand 
Giry mi . Giry’s work focuses on Markov processes and transition 
probabilities, and provides a categorical foundation for probability 
measures. Jones and Plotkin lay further mathematical foundations, 
introducing a probabilistic power domain and showing that it forms 
a monad fl4l . They also provide a A-calculus model for the proba¬ 
bility monad, and propose a probabilistic choice construct serving 
the same purpose as weighted. 

Ramsey and Pfeffer provide Haskell interfaces for several prob¬ 
ability distribution monads, including a sampling monad that cor¬ 
responds roughly to MC l30l . They show that monads offer effi¬ 
cient implementations of support and sampling, but may be expo¬ 
nentially slower than other techniques for calculating expectations. 
To solve this problem, they translate the A-calculus into measure 

Park and colleagues use a sampling monad in a variety of real- 
world robotics applications l27l . Their AQ-calculus uses sampling 
functions to represent binomial, geometric, and Gaussian distri¬ 
butions. They provide two implementations of Bayes’ theorem 
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(one using the rejection method, the other using importance sam¬ 
pling), and a primitive for calculating expectations. This is the best- 
developed version of the MC monad in the literature. 

Erwig and Kollmansberger have written an excellent Haskell 
library supporting both discrete distributions and random sampling, 
corresponding to the DDist and MC monads in this paper l8l . 
They provide a much richer set of higher-order functions than we 
do, including functions for iterated simulation. 

Our work differs from these earlier papers in several ways. 
We focus on the underlying structure of the probability monads, 
showing how to build them from layers of monad transformers. 
We also show that Bayes’ theorem may be expressed naturally 
using MaybeT, removing the need for special primitives. And 
we introduce a new family of monads for sequential Monte Carlo 
sampling. 

8.1 Monad morphisms and monad transformers 

The history of monad morphisms and monad transformers has been 
described in a bibliography by Chung-chieh Shan m Moggi 
originally used monad morphisms to build a modular theory of 
denotational semantics fTTl . Moggi’s theory was adapted for use in 
functional programming languages by King and Wadler fTTl . Later 
work focused heavily on modular interpreters (37] [34] [9| |T9j U3J. 
A typical modular interpreter is based on an eval function in an 
unspecified monad. On top of this foundation, several layers of 
monad transformers are used to implement state, continuations, 
error reporting, and other language features. 

In another field, Chung-chieh Shan applied monad morphisms 
to natural language semantics. He used various monad morphisms 
to construct a modular theory of interrogatives, focus, intensional- 
ity and quantification (32). 

Monad transformers are well-supported by the Haskell pro¬ 
gramming language ED- For good tutorials, see Grabmueller Il2) 
andNewbum (25). 

8.2 Other representations of probability distributions 

A variety of techniques have been used to represent probability 
distributions. Fox and colleagues provide an excellent survey of 
Bayesian filtering and belief representations ffOll . Park and col¬ 
leagues also cover much of this ground in their paper on probability 
distribution monads (27). 

In robotics applications, probability distributions are often rep¬ 
resented as Kalman filters (3T|[lO). Kalman filters are highly effi¬ 
cient, but they can only represent simple Gaussian distributions. 
Under certain assumptions, however, Kalman filters are theoret¬ 
ically optimal Oil- Unfortunately, our probability monad tookit 
does not support Kalman filters, because Gaussian distributions 
cannot be defined over arbitrary Haskell data types. 

There is also an extensive literature on sequential Monte Carlo 
sampling and particle filters. For a good overview, see Fox and 
colleagues flOj . For a detailed discussion of current techniques, see 
Doucet and colleagues (6). 

8.3 Probabilistic programming languages 

Probabilistic programming languages are an enormously rich area 
of research. We describe a few here, just to give a sense of the sheer 
diversity. 

Pfeffer’s IBAL programming language supports sophisticated 
probabilistic inference (29). IBAL is a functional language, with 
constructs similar to those in this paper. But where our monads di¬ 
rectly calculate probability distributions, IBAL uses a two-phase 
inference algorithm to efficiently solve large problems. Phase 1 an¬ 
alyzes the program and produces an optimized computation graph. 
Phase 2 solves the computation graph and returns a probability dis¬ 
tribution over the possible outputs. 


Allison describes a Haskell library for machine learning and in¬ 
ductive inference (3]|4). Inductive inference constructs a Bayesian 
network from real-world data, filling in the connections automati¬ 
cally. The major drawback of this approach is the risk of “over¬ 
fitting” which occurs when the inference engine constructs a 
excessively-detailed model that describes every quirk of the train¬ 
ing data. To avoid this problem, Allison uses Minimum Message 
Length (MML) to choose a model minimizing the combined size 
of the model and the compressed training data. 

A rich variety of probabilistic logic programming languages 
have also appeared in the literature. Some examples include Ng 
and Subrahmanian (26), and Muggleton (24). 

9. Conclusion 

In this paper, we introduced a toolkit for building probability 
monads. Many components of the toolkit already existed in stan¬ 
dard Haskell, or in various proposed libraries. These included the 
list monad, the Monte Carlo sampling monad, and the MaybeT 
monad transformer. Two of our components, however, were new: 
The SMC monad (Section [71, which performs sequential Monte 
Carlo sampling, and the PerhapsT monad transformer (Section 
[5}, which is a stripped-down version of WriterT Prob. 

We also demonstrated some novel applications of this toolkit. 
These included an implementation of Bayes’ theorem using MaybeT 
(Section |6), a nd a monad which performs weighted particle filtering 
(Section |7.2| . 

The most important benefit of the modular toolkit, however, 
has been the ease with which we can construct new monads. This 
facilitates tinkering and experimentation, and frequently offers us 
new perspectives on well-known techniques. Many of the monads 
in this paper were discovered by asking, “Hey, what happens if 
we combine this with one of these?” We leave a great many such 
questions unanswered, pending further exploration. 
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